Sains Malaysiana 52(8)(2023):
2431-2451
http://doi.org/10.17576/jsm-2023-5208-19
A New Single
Linkage Robust Clustering Outlier Detection Procedures for Multivariate Data
(Suatu Prosedur Baharu Pengesanan Data Terpencil Berasaskan Pengelompokan Rangkaian Tunggal Teguh bagi Data Multivariat)
SHARIFAH SAKINAH SYED ABD MUTALIB1,2, SITI
ZANARIAH SATARI1,* & WAN
NUR SYAHIDAH WAN YUSOFF1
1Centre for Mathematical Sciences, College of Computing and
Applied Sciences, Universiti
Malaysia Pahang, Lebuhraya Tun Razak, 26300 Gambang, Kuantan, Pahang, Malaysia
2Faculty of
Computer, Media and Technology Management, University College TATI, Jalan
Panchur, Telok Kalong, 24000 Kemaman, Terengganu, Malaysia
Received: 3 January 2023/Accepted: 1 August 2023
Abstract
Outliers are abnormal data, and the
detection of outliers in multivariate data has always been of interest. Unlike
univariate data, outlier detection for multivariate data is insufficient with a
visual inspection. In this study, we developed a new single linkage robust
clustering outlier detection procedure for multivariate data. A robust
estimator, Test on Covariance (TOC) is used to robustified the similarity distance measure, producing robust single linkage clustering.
The performance of the new single linkage robust clustering outlier detection
procedure is investigated via a simulation study using three outlier scenarios
and historical multivariate datasets as illustrative examples. Three
performance measures are used, which are pout, pmask, and pswamp. The performance of the new
single linkage robust clustering procedure also compared with single linkage
clustering using Euclidean and Mahalanobis distances
as similarity distance measures as well as TOC. It is found that the new single
linkage robust clustering procedure performs well in Outlier Scenario 3 when
the mean and covariance matrix are shifted. The new procedure also performs
well by successfully detecting all outliers, does not have masking effects in
two out of five datasets and does not have swamping effect in all datasets. In
conclusion, the new single linkage robust clustering outlier detection
procedure is a practical and promising approach and good for simultaneously
identifying multiple outliers in multivariate data.
Keywords: Multivariate
data; outliers; single linkage clustering; Test on Covariance; robust
clustering
Abstrak
Data terpencil ialah data tidak normal dan pengesanan data terpencil untuk data multivariat sentiasa menjana minat. Tidak seperti data univariat, pengesanan data terpencil untuk data multivariat tidak mencukupi dengan pemeriksaan visual. Dalam kajian ini, kami membangunkan satu prosedur baru pengesanan data terpencil berasaskan pengelompokan rangkaian tunggal teguh bagi data multivariat. Penganggar teguh, Test on Covariance (TOC) digunakan untuk meneguhkan ukuran jarak persamaan, menghasilkan pengelompokan rangkaian tunggal teguh. Prestasi prosedur baru pengesanan data terpencil berasaskan pengelompokan rangkaian tunggal teguh disiasat melalui kajian simulasi menggunakan tiga senario data terpencil dan set data sedia ada multivariat sebagai contoh ilustrasi. Tiga ukuran prestasi digunakan, iaitu pout, pmask dan pswamp. Prestasi prosedur baru pengesanan data terpencil berasaskan pengelompokan rangkaian tunggal teguh juga dibandingkan dengan pengelompokan rangkaian tunggal menggunakan jarak Euclidean dan Mahalanobis sebagai ukuran jarak persamaan beserta TOC. Didapati bahawa prosedur baru pengesanan data terpencil berasaskan pengelompokan rangkaian tunggal teguh berprestasi baik dalam Senario Data Terpencil 3 apabila min dan matriks kovarians dianjakkan. Prosedur baru juga berfungsi dengan baik apabila berjaya mengesan semua data terpencil dan tidak mempunyai kesan masking dalam 2 daripada 5 set data dan tidak mempunyai kesan swamping dalam semua set data. Kesimpulannya, prosedur baru pengesanan data terpencil berasaskan pengelompokan rangkaian tunggal teguh ialah pendekatan yang praktikal dan menjanjikan, serta bagus untuk mengesan data terpencil yang berkelompok secara serentak dalam data multivariat.
Kata kunci: Data multivariat; data terpencil; pengelompokan rangkaian tunggal; pengelompokan teguh; Test on
Covariance
REFERENCES
Abd Mutalib, S.S.S., Satari, S.Z. & Yusoff, W.N.S.W. 2021a. Comparison of robust
estimators for detecting outliers in multivariate data. Journal of
Statistical Modeling and Analytics 3(2): 36-64.
Abd Mutalib, S.S.S., Satari, S.Z. & Yusoff, W.N.S.W. 2021b.
Comparison of robust estimators for detecting outliers in multivariate
datasets. Journal of Physics: Conference Series 1988: 1-9.
Abd Mutalib, S.S.S., Satari, S.Z. & Yusoff, W.N.S.W. 2019. A new robust estimator to detect outliers for
multivariate data. Journal of Physics: Conference Series 1366(1):
012104. https://doi.org/10.1088/1742-6596/1366/1/012104
Aggarwal, C.C. 2017. Outlier Analysis. 2nd ed. Springer. https://doi.org/10.1016/b978-012724955-1/50180-7
Almeida, J.A.S., Barbosa, L.M.S., Pais, A.A.C.C. & Formosinho, S.J.
2007. Improving hierarchical cluster analysis: A New method with outlier
detection and automatic clustering. Chemometrics and Intelligent Laboratory
Systems 87(2): 208-217. https://doi.org/10.1016/j.chemolab.2007.01.005
Atkinson, A.C. 1994. Fast very robust methods for the detection of
multiple outliers. Journal of the American Statistical Association 89(428): 1329-1339. https://doi.org/10.1080/01621459.1994.10476872
Atkinson, A.C. & Mulira, H.M. 1993. The stalactite plot for the
detection of multivariate outliers. Statistics and Computing 3(1):
27-35. https://doi.org/10.1007/BF00146951
Badaró, J.P.M., Campos, V.P., Oliveira Campos da Rocha, F. & Lima
Santos, C. 2021. Multivariate analysis of the distribution and formation of
trihalomethanes in treated water for human consumption. Food Chemistry 365: 130469. https://doi.org/10.1016/j.foodchem.2021.130469
Balcan, M-F., Liang, Y. & Gupta, P. 2014. Robust hierarchical
clustering. Journal of Machine Learning Research 15: 4011-4051.
https://doi.org/10.1109/IMSCCS.2006.167
Becker, C. & Gather, U. 1999. The masking breakdown point of
multivariate outlier identification rules. Journal of the American Statistical
Association 94(447): 947-955.
https://doi.org/10.1080/01621459.1999.10474199
Cabana, E., Lillo, R.E. & Laniado, H. 2021. Multivariate outlier
detection based on a robust Mahalanobis distance with shrinkage estimators. Statistical
Papers 62: 1583-1609. https://doi.org/10.1007/s00362-019-01148-1
Cerioli, A., Riani, M. & Torti, F. 2011. Accurate and powerful
multivariate outlier detection. Int. Statistical Inst.: Proc. 58th World
Statistical Congress. pp. 5608-5613.
Christy, A., Gandhi, M.G. & Vaithyasubramanian, S. 2015. Cluster based
outlier detection algorithm for healthcare data. Procedia Computer Science 50: 209-215. https://doi.org/10.1016/j.procs.2015.04.058
Daudin, J.J., Duby, C.D. & Trecourt, P. 1988. Stability of principal
component analysis studied by the bootstrap method. Statistics: A Journal of
Theoretical Applied Statistics 19(2): 241-258.
https://doi.org/10.1080/02331888808802095
De Maesschalck, R., Jouan-Rimbaud, D. & Massart, D. 2000. Tutorial:
The Mahalanobis distance. Chemometrics and Intelligent Laboratory Systems 50: 1-18. www.elsevier.comrlocaterchemometrics.
Dotto, F., Farcomeni, A., García-Escudero, L.A. & Mayo-Iscar, A. 2018.
A reweighting approach to robust clustering. Statistics and Computing 28(2): 477-493. https://doi.org/10.1007/s11222-017-9742-x
Duan, L., Xu, L., Liu, Y. & Lee, J. 2009. Cluster-based outlier
detection. Annals of Operations Research 168: 151-168.
https://doi.org/10.1007/s10479-008-0371-9
Evans, K., Love, T. & Thurston, S.W. 2015. Outlier identification in
model-based cluster analysis. Journal of Classification 32(1): 63-84.
https://doi.org/10.1007/s00357-015-9171-5
Fauconnier, C. & Haesbroeck, G. 2009. Outliers detection with the
minimum covariance determinant estimator in practice. Statistical
Methodology 6(4): 363-379. https://doi.org/10.1016/j.stamet.2008.12.005
Filzmoser, P., Maronna, R. & Werner, M. 2008. Outlier identification
in high dimensions. Computational Statistics and Data Analysis 52(3):
1694-1711. https://doi.org/10.1016/j.csda.2007.05.018
Gan, G., Ma, C. & Wu, J. 2007. Data Clustering: Theory, Algorithms,
and Applications. Philadelphia: Society for Industrial and Applied
Mathematics.
Garcia-Escudero, L.A., Gordaliza, A., Matran, C. & Mayo-Iscar, A.
2010. A review of robust clustering methods. Advances in Data Analysis and
Classification 4(2): 89-109. https://doi.org/10.1007/s11634-010-0064-5
García-Escudero, L.A., Gordaliza, A., Matrán, C. & Mayo-Iscar, A. 2008.
A general trimming approach to robust cluster analysis. The Annals of
Statistics 36(3): 1324-1345. https://doi.org/10.1214/07-AOS515
Hadi, A.S. 1992. Identifying multiple outliers in multivariate data. Journal
of the Royal Statistical Society. Series B (Methodological) 54(3): 761-771.
Hadi, A.S., Rahmatullah Imon, A.H.M. & Werner, M. 2009. Detection of
outliers. Wiley Interdisciplinary Reviews: Computational Statistics 1(1): 57-70. https://doi.org/10.1002/wics.6
Hardin, J. & Rocke, D.M. 2004. Outlier detection in the multiple
cluster setting using the minimum covariance determinant estimator. Computational
Statistics & Data Analysis 44(4): 625-638.
https://doi.org/10.1016/S0167-9473(02)00280-3
Hawkins, D.M., Bradu, D. & Kass, G.V. 1984. Location of several
outliers in multiple-regression data using elemental sets. Technometrics 26(3): 197-208. https://doi.org/10.1080/00401706.1984.10487956
Herwindiati, D.E., Djauhari, M.A. & Mashuri, M. 2007. Robust
multivariate outlier labeling. Communications in Statistics-Simulation and
Computation 36(6): 1287-1294. https://doi.org/10.1080/03610910701569044
Ijaz, M.F., Attique, M. & Son, Y. 2020. Data-driven cervical cancer
prediction model with outlier detection and over-sampling methods. Sensors 20: 1-22.
Jiang, M.F., Tseng, S.S. & Su, C.M. 2001. Two-phase clustering process
for outliers detection. Pattern Recognition Letters 22(6-7): 691-700.
https://doi.org/10.1016/S0167-8655(00)00131-8
Kalina, J. & Tichavský, J. 2021. The minimum weighted covariance determinant
estimator for high-dimensional data. Advances in Data Analysis and
Classification. https://doi.org/10.1007/s11634-021-00471-6
Kosinski, A.S. 1999. A procedure for the detection of multivariate
outliers. Computational Statistics and Data Analysis 29(2): 145-161.
https://doi.org/10.1016/S0167-9473(98)00073-5
Maronna, R.A. & Yohai, V.J. 1995. The behavior of the Stahel-Donoho
robust multivariate estimator. Journal of the American Statistical
Association 90(429): 330-341.
https://doi.org/10.1080/01621459.1995.10476517
Melendez-Melendez, G., Cruz-Paz, D., Carrasco-Ochoa, J.A. &
Martínez-Trinidad, J.F. 2019. An improved algorithm for partial clustering. Expert
Systems with Applications 121: 282-291. https://doi.org/10.1016/j.eswa.2018.12.027
Milligan, G.W. & Cooper, M.C. 1985. An examination of procedures for
determining the number of clusters in a data set. Psychometrika 50(2):
159-179. https://doi.org/10.1007/BF02294245
Mojena, R. 1977. Hierarchical grouping methods and stopping rules: An
evaluation. The Computer Journal 20(4): 259-363.
Olukanmi, P.O. & Twala, B. 2017. K-means-sharp: Modified centroid
update for outlier-robust k-means clustering. 2017 Pattern Recognition
Association of South Africa and Robotics and Mechatronics International
Conference, PRASA-RobMech 2017. pp. 14-19.
https://doi.org/10.1109/RoboMech.2017.8261116
Pan, J-X., Fung, W-K. & Fang, K-T. 2000. Multiple outlier detection in
multivariate data using projection pursuit techniques. Journal of
Statistical Planning and Inference 83(1): 153-167.
https://doi.org/10.1016/s0378-3758(99)00091-9
Peña, M. 2018. Robust clustering methodology for multi-frequency acoustic
data: A review of standardization, initialization and cluster geometry. Fisheries
Research 200: 49-60. https://doi.org/10.1016/j.fishres.2017.12.013
Rencher, A.C. 2002. Methods of Multivariate Analysis. New York:
John Wiley & Sons, Inc. https://doi.org/10.2307/2669873
Rocke, D.M. & Woodruff, D.L. 1996. Identification of outliers in multivariate
data. Journal of the American Statistical Association 91(435):
1047-1061. https://doi.org/10.1080/01621459.1996.10476975
Rousseeuw, P.J. & Hubert, M. 2011. Robust statistics for outlier
detection. Wiley Interdisciplinary Reviews: Data Mining and Knowledge
Discovery 1(1): 73-79. https://doi.org/10.1002/widm.2
Rousseeuw, P.J. & van Zomeren, B.C. 1990. Unmasking multivariate
outliers and leverage points. Journal of the American Statistical
Association 85(411): 633-639. https://doi.org/10.2307/2289999
Santos-Pereira, C.M. & Pires, A.M. 2002. Detection of outlier in multivariate
data: A method based on clustering and robust estimators. In Compstat, edited
by Härdle, W. & Rönz, B. Physica, Heidelberg. pp. 291-296.
https://doi.org/10.1007/978-3-642-57489-4_41
Salleh, R.M. 2013. A robust estimation method of location and scale
with application in monitoring process variability. PhD Thesis. Universiti
Teknologi Malaysia (Unpublished).
Satari, S.Z. 2015. Parameter estimation and outlier detection for
some types of circular model. PhD Thesis. University of Malaya (Unpublished).
Satari, S.Z., Muhammad Di, N.F. & Zakaria, R. 2019. Single-linkage method to detect multiple outliers with different
outlier scenarios in circular regression model. AIP Conference Proceedings 2059: 020003. https://doi.org/10.1063/1.5085946
Satari, S.Z., Muhammad Di, N.F. Zubairi, Y.Z. & Hussin, A.G. 2021. Comparative study of clustering-based outliers
detection methods in circular-circular regression model. Sains Malaysiana 50(6): 1787-1798. https://doi.org/10.17576/jsm-2021-5006-24
Saxena, A., Prasad, M., Gupta, A., Bharill, N., Prakash Patel, O.P.,
Tiwari, A., Er, M.J., Ding, W. & Lin, C-T. 2017. A review of clustering
techniques and developments. Neurocomputing 267: 664-681.
https://doi.org/10.1016/j.neucom.2017.06.053
Sebert, D.M., Montgomery, D.C. & Rollier, D.A. 1998. A clustering
algorithm for identifying multiple outliers in linear regression. Computational
Statistics & Data Analysis 27(4): 461-484.
https://doi.org/10.1016/S0167-9473(98)00021-8
Sharma, K.K. & Seal, A. 2021. Outlier-robust multi-view clustering for
uncertain data. Knowledge-Based Systems 211: 106567.
https://doi.org/10.1016/j.knosys.2020.106567
Wada, K., Kawano, M. & Tsubaki, H. 2020. Comparison of multivariate
outlier detection methods for nearly elliptical distributions. Austrian
Journal of Statistics 49(2): 1-17. https://doi.org/10.17713/ajs.v49i2.872
Wang, H., Bah, M.J. & Hammad, M. 2019. Progress in outlier detection
techniques: A survey. IEEE Access 7: 107964-108000.
https://doi.org/10.1109/ACCESS.2019.2932769
Werner, M. 2003. Identification of multivariate outliers in large data
sets. MSc. University of Colorado (Unpublished).
Xu, D. & Tian, Y. 2015. A comprehensive survey of clustering
algorithms. Annals of Data Science 2(2): 165-193.
https://doi.org/10.1007/s40745-015-0040-1
Yesilbudak, M. 2016. Partitional clustering-based outlier detection for
power curve optimization of wind turbines. In 5th International Conference
on Renewable Energy Research and Applications (ICRERA). pp. 1080-1084.
Yoon, K-A., Kwon, O-S. & Bae, D-H. 2007. An approach to outlier
detection of software measurement data using the K-means clustering method. First
International Symposium on Empirical Software Engineering and Measurement (ESEM
2007). pp. 443-445. https://doi.org/10.1109/ESEM.2007.49
Zhang, J. 2013. Advancements of outlier detection: A survey. ICST Transactions
on Scalable Information Systems 13(1): 1-26.
https://doi.org/10.4108/trans.sis.2013.01-03.e2
*Corresponding author; email: sharifahsakinah84@gmail.com
|